Goto

Collaborating Authors

 spanish and english


Multimodal and Multilingual Embeddings for Large-Scale Speech Mining

Neural Information Processing Systems

We present an approach to encode a speech signal into a fixed-size representation which minimizes the cosine loss with the existing massively multilingual LASER text embedding space. Sentences are close in this embedding space, independently of their language and modality, either text or audio. Using a similarity metric in that multimodal embedding space, we perform mining of audio in German, French, Spanish and English from Librivox against billions of sentences from Common Crawl. This yielded more than twenty thousand hours of aligned speech translations. To evaluate the automatically mined speech/text corpora, we train neural speech translation systems for several languages pairs.


Multimodal and Multilingual Embeddings for Large-Scale Speech Mining

Neural Information Processing Systems

We present an approach to encode a speech signal into a fixed-size representation which minimizes the cosine loss with the existing massively multilingual LASER text embedding space. Sentences are close in this embedding space, independently of their language and modality, either text or audio. Using a similarity metric in that multimodal embedding space, we perform mining of audio in German, French, Spanish and English from Librivox against billions of sentences from Common Crawl. This yielded more than twenty thousand hours of aligned speech translations. To evaluate the automatically mined speech/text corpora, we train neural speech translation systems for several languages pairs.


Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text

arXiv.org Artificial Intelligence

Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents unique challenges, primarily stemming from the scarcity of labelled data and available resources. In this study, we investigate how pre-trained Language Models handle code-switched text in three dimensions: a) the ability of PLMs to detect code-switched text, b) variations in the structural information that PLMs utilise to capture code-switched text, and c) the consistency of semantic information representation in code-switched text. To conduct a systematic and controlled evaluation of the language models in question, we create a novel dataset of well-formed naturalistic code-switched text along with parallel translations into the source languages. Our findings reveal that pre-trained language models are effective in generalising to code-switched text, shedding light on the abilities of these models to generalise representations to CS corpora.


Babies who grow up bilingual have better problem-solving skills before they can talk

Daily Mail - Science & tech

Learning a second language when you are young has long been known to boost brainpower. Now researchers have found that the brains of babies exposed to two languages benefit from this extra boost even before they can utter a word. Scientists claim that just growing up in a home or environment where they are listening to more than one language being spoken could improve a child's problem solving skills and memory. Researchers have found that the brains of babies exposed to two languages develop better, even before they can utter a word. They claim that just growing up in a home or environment where they are listening to more than one language being spoken could improve a child's problem solving skills and memory Previous studies suggest that speaking two or more languages from a very young age helps a child's development into adults with more highly refined cognitive skills.